home *** CD-ROM | disk | FTP | other *** search
- GetAllHTML v1.00 Copyright 1998-2002 Christoper S Handley
- ==========================================================
- Latest News
- -----------
- This is (probably) my final release, as I have not updated the code in two
- years, and it seems to work very well for me. Since it works well, I have
- finally given it v1.00 status!
-
- Should now work on even more HTML pages due to minor improvements.
-
-
- If you cannot get ARexx to work, please read my warning below about text
- editors.
-
- Introduction
- ------------
- Have you ever visited a cool web site & wanted to keep a copy of some/all of it,
- but it would takes ages to find & download all the respective pages/files?
-
- This is the answer!
-
- You supply this ARexx script with the start page URL, and a destination
- directory (which should be empty), and maybe a few other options - and off it
- goes! Note that it needs HTTPResume v1.3+ to work (get from Aminet).
-
- The idea for this came from a PC Java program called PageSucker - sadly it is
- over 1Mb in size & buggy (& can't be run on the Amiga, yet). Although my
- implementation may not have quite as many features, it is does do the job quite
- fast, & has relatively low memory overheads.
-
- Requirements
- ------------
- o HTTPResume v1.3+ (but v1.7 is recommended)
-
- o An Amiga capable of running ARexx programs
-
- o Libs:Rexxsupport.library
-
- o Modem with TCP/IP stack (like Genesis, Miami, or AmiTCP)
-
- Usage
- -----
- 1.Before running it for the first time you must (text) edit (say using C:ED) it
- to know where your copy of HTTPResume is located. Go to line 19 where it says:
- HTTPResume='Programs:Utils/Comms/HTTPResume'
- Alter the file path between the 'quotes' to where you keep HTTPResume, and save.
-
- 2.Run your TCP/IP stack (e.g.AmiTCP/Genesis or Miami).
-
- 3.Run it from a Shell using:
-
- Sys:RexxC/Rx GetAllHTML arguments
-
- Where the arguments are:
-
- "URL"/A, "DestDir"/A, NOASK/S, ARC/S, PIC/S, RESUME/S, PAUSE/S, DEPTH=/N/K,
- NOBASEINDEX, PORT=/K, BASEURL=/K, BROKELINKS/S
-
- Note - The destination dir must be empty of any previous attempt at downloading
- that web page (unless using the RESUME switch). And that both URL & DestDir
- *must* be enclosed in "double quotes" - and that the BASEURL must NOT be
- surrounded by quotes!
-
- *Note* you may have several GetAllHTMLs & HTTPResumes running at the same time
- (not on the same URL!), and that if you use the PORT argument then you will need
- HTTPResume running first.
-
- See the file GetAllHTML_ex.script for an example usage - it will download all of
- Squid's on-line artwork (hope he gets a few more sales of his wonderful 'other
- world' artwork from this :-).
-
- Behaviour
- ---------
- It's default behaviour is to find all links in each HTML page, and download them
- if:
-
- -the URL path is a sub-directory of the original URL path; this stops
- downloading irrelevant pages on different topics, different servers, etc.., AND
-
- -if they are HTML pages (name ends in .html, etc), OR
-
- -if not HTML pages then it will ask if the file should be downloaded; if answer
- does not begin with an "n" then it does (download). If the file has the same
- suffix as the last positively confirmed download, then 'intelligently' assumes
- should download...
-
- This behaviour is modified by various switches:
-
- RESUME - Should downloading of pages have been interupted (maybe a crash), run
- GetAllHTML with *exactly* the same options, except with this switch too.
- It will take a while to reach the same place as all previous HTML pages
- must be scanned - and some additional memory usage is incured.
-
- I suggest you don't go on-line until it has reached the previously
- interupted point (it wait for you to press return).
-
- *NOTE* that this mode is flawed due to the way GetAllHTML.rexx works,
- so that it will sometimes think it has reached the previously finished
- point, when it has not in fact. Still RESUME is very useful! And an
- AmigaE version would fix this.
-
- PIC - Will identify links to pictures & download them rather than ask.
-
- ARC - Will identify links to archives & download them rather than ask.
-
- NOASK - Do not ask user if should download a file, assume should not.
-
- PAUSE - DO ask user to "press <return>" if we get an empty URL or couldn't
- download a file. Always asked to by the RESUME function.
-
- TERSE - Only outputs very important text, so it won't report strange URLs,
- failure to download files, non-http links, etc...
-
- PORT - Supplies the ARexx port of a *running* HTTPResume; if supplied then
- does not try to launch HTTPResume from AmigaDOS. See *Note* below. If
- no port name is supplied then it sees if the current ARexx port is
- already set to some HTTPResume - if not then an error is generated, else
- it just uses that port.
-
- DEPTH - This allows you to specify how many URL links to follow in sequence;
- i.e.the depth of the search. DEPTH=2 means download only the links from
- the original page, and so on for DEPTH=3, etc.. See *Note* below. If
- no number is supplied then the user is asked for a number.
-
- NOBASEINDEX - This prevents GetAllHTML from following links to the index of the
- original URL path. Is useful for web sites which are fairly 'flat'
- (ie.do not have many sub-directories), which also have banners at the
- top of pages to take you straight to the site's main page (which will
- has links you are not interested in). Probably very rarely needed.
-
- BASEURL - This allows you to override the semi-intelligent default, and tell
- GetAllHTML the base URL - that is, what the URL must start with for it
- to even consider downloading it. This is useful if you wish to download
- from a particular page deep down the directory structure, but which
- references images (or maybe pages) that are further up.
-
- BROKENLINKS - This causes attempted downloading of pages that are not
- sub-directories of the original URL. These will be downloaded to "T:"
- (which is usually "RAM:T") & then deleted. If download failed then you
- will be told that there was a broken link. See suggested uses below.
-
- *Note* that both DEPTH & PORT must be followed by an equals sign ("=") and then
- the data, _without_ any spaces between anything. This is due to a limitation of
- ARexx, for which an AmigaE version would fix.
-
- SOLUTIONS TO FREQUENT PROBLEMS
- ------------------------------
- 1.If ARexx complains there is an error in the GetAllHTML code (when you try to
- run it), then it is VERY likely that the problem comes from editing GetAllHTML
- in a text editor.
-
- Many text editors seem to be EXTREMELY badly coded, so that they can't handle
- the very long lines I have in GetAllHTML; they usually end up splitting a line
- in 2, or even deleting everything past a certain point on the line...
-
- I strongly suggest buying CygnusEd (v3.5 or v4), as this is what I use, and I
- have never had any problems with it; it can even load in binary files, and save
- them back without loosing anything!
-
- Suggested uses
- --------------
- 1.There's a big web site with lots of pictures/music/archives/information that
- interests you. Depending on what you want, you will need to use the PIC, ARC,
- and NOASK switches.
-
- For instance, if you are only interested in pictures then use PIC & NOASK. If
- you are only interested in archives then use ARC & NOASK. If you are interested
- in something else than pictures or archives (in addition to the web pages), then
- don't use any of those three switches - GetAllHTML will ask you if something
- should be downloaded.
-
-
- 2.You have your own home-page on the web, and it includes lots of links to other
- sites for which take hours to check they are all valid. Point GetAllHTML at
- your web site with the BROKENLINKS switch. Note it will never try to download a
- link twice, even withOUT using RESUME.
-
- In fact, if you have your web site in a directory on your HD, then you could
- RESUME with that directory as your download directory; this will be MUCH faster
- since none of your pages will (or should) be downloaded :-)) . First time you
- try this, do it on a back-up copy to ensure GetAllHTML does not do anything
- strange (I won't be held responsible for extra files 'magically' appearing!)..
-
-
- 3.I haven't tried this out myself, but it should be really cool:
-
- If you have a favourite news page then you can use GetAllHTML to download just
- the latest news. Suggest use NOASK, possibly with PIC if you want pictures too.
- You will obviously need to delete (or rename) the main news-page file, to force
- GetAllHTML to download the latest index which contains links to the new news.
-
- I don't think you need to use RESUME, but...
-
-
- Any other ideas?
-
- Bugs & other unwelcome features
- -------------------------------
- o The RESUME feature *may* cause some files to be missed, but this depends on
- how ARexx treats funny characters for variable names. An AmigaE version would
- fix any problems.
-
- o Interpretation of the HTML & URLs is based on observation rather than any
- specification of these standards - thus there will probably be rare cases in
- which it may misinterpret them; as long as these are reported (along with the
- responsible HTML file(s)), fixes will probably be forth coming.
-
- But it really seems to work fine for the sites I have tried :-)
-
- o You cannot go above a depth of 42; this is to protect against an ARexx
- limitation which will cause problems above a depth of about 45. An AmigaE
- version would fix this.
-
- Technical info
- --------------
- GetAllHTML uses a depth-first tree search, via recursion, and uses the existance of
- (downloaded) files as a super-fast (for ARexx!) record of whether a page has
- been visited or not.
-
- When RESUMEing, existance of files cannot be used to record if a page has been
- visited, so an alternate method is used - this is slower, and could fail with
- certain combinations of strangely named URLs (very unlikely); a far slower
- method would avoid this, but was considered unnecessary.
-
- I used the INTERPRET command to do some magic with ARexx to make the arbitarily
- long linked-lists (really branches) possible - they were required for storing
- what pages have been visited. Although this method is not very memory efficient
- (many duplicate entries of the same URL), it is quite fast - and more
- importantly it *works* in ARexx. I had thought it would be virtually impossible
- to make arbitarily extended link-lists in ARexx, but the interpretive nature of
- ARexx means you can effectively create ARexx commands on-the-fly.
-
- Future
- ------
- I have no plans for the future. It has proven to work very well, and I have not
- updated it in 2 years.
-
- About the Author
- ----------------
- My name is Christopher S Handley. You're not really interested in me are you?!
-
- Contacting the Author
- ---------------------
- Email: cshandley@iee.org
-
- I am not yet sure about giving my snail mail address to all & sundry - sorry
- about that :-(I know how I felt when people did that before I had email access).
-
- Thanks to
- ---------
- o Andrija Antonijevic for HTTPResume
-
- o the Amiga community for sticking with the Amiga, and continuing to innovate.
- Give your backing to KOSH (originally proposed by Fleecy Moss).
-
- o CU Amiga for becoming the greatest Amiga mag over the last year, before
- passing away. I did not like AF's Xmas issue at all (and AF didn't appear to
- like my criticism of it either...)
-
- o whoever design the Rexx language - Rexx is great for 'user utilities'.
-
- History
- -------
- v1.00 (22-08-02) - Should now handle URLs with "?" in them. Now considers
- SWF (flash) files as pictures. Final version?
- Added anti-spam stuff to email address :(
- v0.66ß (30-01-00) - Updated docs with new email address, warning about text
- editors causing problems, and other small changes.
- v0.65ß (16-10-99) - Now searches from "HREF=" not "<A HREF=" which will give
- dramatic improvement on some pages! Rewrote how "./"s are
- interpreted, which seems to be much better (perfect?).
- Added the NOBASEINDEX switch. Removed uses of Translate()
- since it was doing nothing - so a small speed-up (esp.for
- RESUME).
- v0.64ß (04-04-99) - Put back the 'extra' END that I removed in v0.61 . Now
- BROKENLINKS will always only try to download external links
- once. Removed NOENV argument of HTTPResume so proxy
- settings may work. Minor changes.
- v0.63ß (04-04-99) - Removed spurious non-visible ASCII (27) characters that
- caused some text editors to go loopy.
- v0.62ß (03-04-99) - Add the BROKENLINKS switch. Replaced NOPAUSE by PAUSE
- switch. Now always warns if a file could not be downloaded
- (not just pages). If you used all the arguments then it
- would miss the last one.
- v0.61ß (28-03-99) - Possible fix for RESUME problem done, plus stupidly left an
- extra END where it broke GetAllHTML.
- v0.60ß (27-03-99) - First stand-alone Aminet release. Damn! There were 3 big
- mistakes... (a)some files expected as directories,
- (b)local-path expansion was too complex & probably wrong
- (hope right now), (c)implicit InDeX.hTmL files were not
- scanned for files. Also asked user to press return but
- really wanted a key first!
- v0.55ß (14-12-98) - Damn! All this fast programming has introduced some bugs,
- but they are fixed now; included the "~" interpretation was
- completely wrong (removed), and fixed a long standing bug
- where a URL beginning with a slash was mis-understood. Also
- added the BASEURL feature which is really useful sometimes.
- v0.54ß (12-12-98) - Given I couldn't download the KOSH pages (www.kosh.net), I
- added basic frame support, and fixed a long standing bug
- where root html pages could appear as empty directories!
- Two more long standing bugs fixed (ARC & PIC switches had
- inverted sense). Add fix for paths with "~" in, so will
- align path correctly. Add semi-intelligence so that won't
- ask about downloading a file with the same suffix as the
- last file that was confirmed. Add the TERSE function.
- v0.53ß (10-12-98) - The DEPTH feature now works, added the PORT feature,
- added the NOPAUSE feature. Fixed long standing bug of NOASK
- not being recognised. Now removes text in URL after "?"s.
- v0.52ß ( 8-12-98) - Basically update documentation ready for it's first Aminet
- release, for when packaged along with HTTPResume. Added an
- untested DEPTH feature in the special v0.52aß.
- v0.51ß (??-??-98) - Internal speed-up (may be much faster downloading small pages)
- - minor fix of arguments given to HTTPResume
- v0.5ß (??-??-98) - Initial release to a few people. No bugs, honest =:)
-